Sjogren syndrome microarray data analysis

Topics to be covered:

Data preparation

annot.txt is taken from Single-experiment raw data3//annot3.txt

data.txt are combined from 01.Data.Analysis.xlsx and 11.Data.xlsx.

meta1.txt is taken from 11 Annotations final cohorts sheet 2 20DEC13 DF.xlsx and contains cohort information. The data has been transposed to have patients names as columns for compatibility with data.txt

meta2.txt is taken from 20140214FarrisMicroarraymgdb(1).xlsx (accessed on 03-13-2014). The data has been transposed. ID p1033680-6 has been renamed to p1033680-6 (-5) to be compatible with the data.txt header.

To identify the largest source of variability within the data, we perform principal component analysis and color the samples by Cohort/Treatment status.

Clearly, the cohorts are very different. The two patients, p1033216.2 and p1033680.6…5., appear as outliers. They are also picked up by the arrayQualityMetrics set of tests.

We remove them and look at the principal components again.

The data looks more homogeneous now, with the cohort effect still dominating.

To further investigate how the outliers affect differential gene expression, we will perform differential expression limma analysis on the expression data with and without outlier.

Differential expression with outliers

Only 41 probes are differentially expressed out of 62976 total.

Differential expression without outliers

Removing the outliers improved sensitivity in detecting differentially expressed genes, 169 DEGs.

Using ComBat to account for batch effect

To investigate batch effect in raw data, we will test the differences between the cohorts of patients, without considering their treatment status.

The outliers were removed.

Almost 1/3 of the probes are differentially expressed between cohorts, 19562 DEGs out of 62976 total.

We use ComBat to manually adjust for the cohort effect and keep the treatment effect. If we look at the PCA plot after the adjustment, the data look more homogeneous, and the cohorts are now mixed together.

## Found 2 batches
## Adjusting for 1 covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data

And no differentially expressed genes between the cohorts can be detected after removing the batch effect, 0 DEGs.

Differential expression analysis

So far we used the complete dataset with 62976 probes. However, probes constantly expressed across the conditions are less interesting to us. To increase power of detecting differentially expressed genes, we remove such constantly expressed genes.

Almost half of the probes were removed, keeping 31488 probes.

After removing batch effect, we get more differentially expressed genes, 784 DEGs.

Let’s have a look at top 50 differentially expressed genes. Multiple probes for the same gene are collapsed to one by maximum expression level.

We aggregate probe names summarizing expression of the same gene by maximum fold change. The list of all differentially expressed genes, and the full statistical output, are in the “results/Cohort_DEGs_Enrichment.xlsx” file.

The results of functional enrichment analysis of these differentially expressed genes are in the same “results/Cohort_DEGs_Enrichment.xlsx” file.

Legend: “GO” - gene ontology, with focus on “MF”/“BP”/“CC” - molecular functions/biological processes/cellular components functions; “KEGG” - canonical pathways.

Note: For any KEGG pathway, we can visualize it as a figure, and overlay fold changes of differentially expressed genes

Genes best correlating with clinical parameters

Interesting clinical parameters are: WUSFvol, SSFvolL, SSFvolR, LGleft, LGright, FS, LaBioRadvalue, RoBioRadvalue, RFvalue. For each clinical parameter, we answer 3 questions:

  1. What are the top 20 genes best correlated with a clinical parameter?
  2. Are those genes enriched in canonical pathways? Which? In how many?
  3. Are those genes enriched in “biological process” gene ontologies? Which? In how many?

We can look at correlations in multiple ways:

  1. Clinical parameter with all genes, e.g., clinCorrel(meta$WUSFvol, , combat_edata)
  2. Clinical parameter with DEGs only, e.g., clinCorrel(meta$WUSFvol, , combat_edata[rownames(res), ])
  3. Current. Clinical parameter with DEGs only in PSS group, e.g.,

Note, GO and KEGG enrichment analyses are done using genes correlating with clinical parameters with p-value < 0.05.

WUSFvol

SSFvolL

SSFvolR

LGleft

LGright

FS

LaBioRadvalue

RoBioRadvalue

RFvalue

Genes best correlating with the flow counts

We first check how the flow measurements correlate with each other. Numbers in each cell indicate Pearson’s correlation coefficient. It is expected that a measurement will correlate with itself with the coefficient equal to “1”.

Things to watch for: Some different measurements show perfect correlation of “1”. Should it be so?

%CD4+RA-(of CD3+)

%CD8+RA-(of CD3+)

%CD4-8-(of CD3+)

ratio.of.pctCD4plusRAminusperpct.CD8plusRAminus

pctCD38hi27hi.of.TN.